Fine-Tune Qwen2.5-3B with DPO & Unsloth

Fine-Tuning Qwen2.5-3B with DPO using Unsloth on Preference TinyStories Dataset

Finetuning
DPO
Unsloth
Qwen
Author

Quang T. Duong

Published

August 26, 2024

Supervised fine-tuning for large language models (LLMs) enables precise adaptation of a model to specific tasks or domains, significantly improving its performance by providing labeled data tailored to desired outputs. This technique is particularly beneficial in scenarios where off-the-shelf LLMs fail to meet domain-specific or task-specific requirements, such as legal document summarization or medical diagnostics, where accuracy and relevance are critical.

However, supervised fine-tuning alone may not address alignment with human preferences, especially in open-ended or subjective tasks. This is where Reinforcement Learning from Human Feedback (RLHF) techniques come in handy. As they align the model’s outputs with user preferences by incorporating feedback on what users value most, ensuring that the LLM generates responses that are not only accurate but also contextually aligned with human expectations and ethical considerations. These fine-tuning techniques are often named as fine-tuning with preference alignmnent.

This post focuses on Direct Preference Optimization (DPO), a fine-tuning technique that aligns LLMs with human preferences by directly optimizing the model’s outputs based on human feedback.

Preference Alignmnent with DPO

DPO is introduced in the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model by R. Rafailov et al in 2023.

Unlike Proximal Policy Optimization (PPO), a reinforcement learning algorithm that requires training a separate reward model and iterative sampling, DPO simplifies the process by using a supervised learning framework to adjust the model’s behavior according to ranked human preferences.

The principle of DPO involves collecting human preferences on model outputs and using a binary cross-entropy objective to steer the model towards producing desired responses.

This method offers several benefits: it simplifies the training process, reduces computational requirements, and potentially leads to faster and more effective alignment with human values. Additionally, it can be more efficient in mitigating the risk of inheriting biases from training data.

In the next session, we will work on a use-case where we implement DPO to fine-tune a base LLM for a specific question-answering task that aligns with human preference.

Use-case

In the previous post, Fine-tune Qwen2.5-3B using Lora with Unsloth, we have fine-tuned the base model Qwen2.5-3B on the instruction dataset TinyStories_Instruction using Parameter-Efficient Fine-Tuning (PEFT) technique like LoRA to obtain a story generator for children given a story instruction request.

Continue this use-case, in this post we will also fine-tune the base model Qwen2.5-3B, but using DPO. For this end, we need to use a preference dataset TinyStories_Preference, that we have created and discussed in the previous post Create Preference Dataset for DPO Fine-Tuning.

Eventually we will compare stories generated by the two above methods to evaluate their performance.

Let’s jump into the implementation part.

Implementation

Install and import required packages

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes
!pip install -q comet_ml
from unsloth import PatchDPOTrainer
PatchDPOTrainer()

import os
import torch
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth import FastLanguageModel, is_bfloat16_supported
from trl import DPOConfig, DPOTrainer
from google.colab import userdata
import comet_ml

Initialize Comet ML for Experiment Tracking

Set up Comet ML to log experiments, track training metrics, and monitor performance:

comet_ml.login(project_name="dpo-lora-unsloth")

Load Pretrained Model and Tokenizer

Load the pretrained Qwen model and tokenizer from Hugging Face. Specify the maximum sequence length and determine whether to load the model in 4-bit precision for efficiency: True means using QLoRA, False means using LoRA.

max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-3B",
    max_seq_length=max_seq_length,
    load_in_4bit=False,
    )

Apply LoRA Adaptation

LoRA efficiently fine-tunes specific layers, i.e. “q_proj”, “k_proj”, “v_proj”, etc, by introducing additional trainable parameters, significantly reducing memory and computation requirements:

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    lora_alpha=32,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
    )

Dataset Preparation

Format the dataset with a specific Alpaca-like template and append the EOS token to chosen and rejected samples. Then split the dataset into training and testing sets:

alpaca_template = """Below is an instruction that describes a task.
Write a response that appropriately completes the request.
### Instruction:
{}
### Response:
"""

EOS_TOKEN = tokenizer.eos_token
def format_samples(example):
    example["prompt"] = alpaca_template.format(example["prompt"])
    example["chosen"] = example['chosen'] + EOS_TOKEN
    example["rejected"] = example['rejected'] + EOS_TOKEN
    return {
        "prompt": example["prompt"],
        "chosen": example["chosen"],
        "rejected": example["rejected"]
    }
dataset = dataset.map(format_samples)
dataset = dataset.train_test_split(test_size=0.05)

Training Using DPOTrainer

Configure the training process with the DPOTrainer class. Define hyperparameters such as learning rate, batch size, gradient accumulation steps, and number of epochs. Enable Comet ML for logging:

trainer = DPOTrainer(
    model=model,
    ref_model=None,
    tokenizer=tokenizer,
    beta=0.5,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    max_length=max_seq_length//2,
    max_prompt_length=max_seq_length//2,
    args=DPOConfig(
        learning_rate=2e-6,
        lr_scheduler_type="linear",
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        gradient_accumulation_steps=8,
        num_train_epochs=1,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=10,
        output_dir="output",
        eval_strategy="steps",
        eval_steps=0.2,
        logging_steps=1,
        report_to="comet_ml",
        seed=0,
        ),
)


trainer.train()

Model Inference

When the training has finished, let’s test our fine-tuned model. The inference includes formatting the input prompt and using a text streamer for real-time text generation:

FastLanguageModel.for_inference(model)
message = alpaca_template.format("Write a story about a humble little bunny \
named Ben who follows a mysterious trail in the woods, \
discovering beautiful flowers, new friends, and a lovely pond along the way.", "")
inputs = tokenizer([message], return_tensors="pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=2048, use_cache=True)

The generated output is as below:

Once upon a time, there was a little bunny named Ben. Ben lived in a cozy little house in the woods, surrounded by tall trees and a beautiful meadow. He loved to play and explore, but he always felt a little bit lost and alone.

One day, Ben decided to take a walk in the woods. He was feeling a little adventurous and wanted to see what he could find. As he walked, he noticed a trail that led deeper into the woods. He followed the trail, and soon he found himself in a beautiful meadow filled with colorful flowers.

As Ben walked through the meadow, he met a little bird named Lily. Lily was a cheerful bird who loved to sing and dance. She told Ben about all the different flowers she had seen in the meadow, and they spent the rest of the day playing and having fun together.

After a while, Ben and Lily decided to take a break and rest in a cozy little clearing. As they sat there, they heard a gentle sound coming from the pond. They followed the sound and found a lovely pond with clear, blue water. They sat by the pond and watched the fish swim around, and they even saw a little frog hop by.

As the sun began to set, Ben and Lily decided to head back home. They said goodbye to each other and promised to meet again soon. Ben felt a little bit sad, but he knew that he had made some new friends and had discovered some beautiful things in the woods.

From that day on, Ben made a habit of exploring the woods every day. He would follow the mysterious trail and discover new things, new friends, and new adventures. And he knew that no matter where he went, he would always have a little bit of magic in his heart.

Let’s compare it with the story generated by the instructed model:

Once upon a time, there was a humble little bunny named Ben. Ben loved to hop around in the meadow, eating carrots and playing with his friends. One day, Ben saw a mysterious trail in the woods. He was curious and wanted to follow it.

Ben hopped along the trail, hopping faster and faster. He saw beautiful flowers and new friends along the way. Suddenly, he came to a big pond. He hopped into the water and splashed around. It was so much fun!

Ben continued to follow the mysterious trail, hopping and splashing until he couldn't see the end. He was happy to have had such a wonderful adventure. From that day on, Ben knew that he could always follow a mysterious trail and have fun.

It can be seen that the story generated by preference alignment using DPO is more attractive and having more interesting content than the one generated by supervised fine-tune with LoRA on instruction dataset.

Save and Push to Hugging Face Hub

When we are satisfied with our training process, let’s save the fine-tuned model in 16-bit merged format and push it to the Hugging Face Hub for easy sharing and deployment:

from huggingface_hub import login
# Log in to the Hugging Face Hub
login(token=userdata.get('HF_TOKEN'))

model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
model.push_to_hub_merged("tanquangduong/Qwen2.5-3B-DPO-TinyStories", tokenizer, save_method="merged_16bit")

Conclusion

In this article, we have discussed the need for preference alignment in fine-tuning LLMs. While traditional fine-tuning enhances model accuracy for structured tasks, preference alignment techniques like DPO extend the capabilities of LLMs to align with human values and preferences.

The implementation provided in this post demonstrates how to apply DPO for fine-tuning the Qwen2.5-3B model using a preference dataset. This workflow enables the model to generate outputs that are not only accurate but also contextually relevant and ethically aligned with user expectations.